Sampling the Repairs of Functional Dependency Violations under Hard Constraints

نویسندگان

  • George Beskales
  • Ihab F. Ilyas
  • Lukasz Golab
چکیده

Violations of functional dependencies (FDs) are common in practice, often arising in the context of data integration or Web data extraction. Resolving these violations is known to be challenging for a variety of reasons, one of them being the exponential number of possible “repairs”. Previous work has tackled this problem either by producing a single repair that is (nearly) optimal with respect to some metric, or by computing consistent answers to selected classes of queries without explicitly generating the repairs. In this paper, we propose a novel data cleaning approach that is not limited to finding a single repair or to a particular class of queries, namely, sampling from the space of possible repairs. We give several motivating scenarios where sampling from the space of FD repairs is desirable, propose a new class of useful repairs, and present an algorithm that randomly samples from this space. We also show how to restrict the space of generated repairs based on user-defined hard constraints that define an immutable trusted subset of the input relation, and we experimentally evaluate our algorithm against previous approaches. While this paper focuses on repairing FDs, we envision the proposed sampling approach to be applicable to other integrity constraints with large repair spaces.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Query Answering over Functional Dependency Repairs

Inconsistency often arises in real-world databases and, as a result, critical queries over dirty data may lead users to make ill-informed decisions. Functional dependencies (FDs) can be used to specify intended semantics of the underlying data and aid with the cleaning task. Enumerating and evaluating all the possible repairs to FD violations is infeasible, while approaches that produce a singl...

متن کامل

Semandaq: a data quality system based on conditional functional dependencies

We present SEMANDAQ, a prototype system for improving the quality of relational data. Based on the recently proposed conditional functional dependencies (CFDs), it detects and repairs errors and inconsistencies that emerge as violations of these constraints. We demonstrate the following functionalities supported by SEMANDAQ: (a) an interface for specifying CFDs; (b) a visual tool for automated ...

متن کامل

Combining case-based reasoning with tabu search for personnel rostering problems

In this paper we investigate the advantages of using Case-Based Reasoning (CBR) to solve personnel rostering problems. Constraints for personnel rostering problems are commonly categorised as either ‘hard’ or ‘soft’. Hard constraints are those which must be satisfied and a roster which violates none of these constraints is considered to be ‘feasible’. Soft constraints are more flexible and are ...

متن کامل

Enhancing case-based reasoning for personnel rostering with selected tabu search concepts

In this paper we investigate the advantages of using Case-Based Reasoning (CBR) to solve personnel rostering problems. Constraints for personnel rostering problems are commonly categorised as either ‘hard’ or ‘soft’. Hard constraints are those which must be satisfied and a roster which violates none of these constraints is considered to be ‘feasible’. Soft constraints are more flexible and are ...

متن کامل

Pattern-Driven Data Cleaning

Data is inherently dirty and there has been a sustained effort to come up with different approaches to clean it. A large class of data repair algorithms rely on data-quality rules and integrity constraints to detect and repair the data. A well-studied class of integrity constraints is Functional Dependencies (FDs, for short) that specify dependencies among attributes in a relation. In this pape...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2010